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IN THE SPECIFICATION : 

Please substitute the following specifiuaiion for the original specification filed in the 
application. A clean version (without markings) of the substitute specification is also prtwided. 
Applicants hereby assert that no new matter i$ being added. Rather, the changes made herein arc 
to comply with the sug^jestions of the Examiner (in the Office Action of March 25, 2003) and in 
order to correct typographical errors found in the original specification. 
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Substitute Specification (clean version without markings) 
FEATURE WElGHTINf- IN K-MEANS CLUSTERING 

BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invenlion generally rehues lo data chisterinjj and in particular, concerns a 
method and system Ibr providing a framework for integmling multiple^ heterogeneous feature 
spaces in a t-mcans clustering algorithm. 

Description of the Related Ar I 

Cluslering, the grouping togelhur of similar data points in a data set, is a widely used 
procedure for analyzing data for data mining applications. Such applications of clustering 
include unsupervised classification and taxonomy generation, nearest-neighbor searching, 
scientific discovery, v<:ctor quantization, text analysis and navigation, data reduction and 
summarization, supemiarkct database analysis, castomcr/market segmentation, and time scries 
analysis. 

One of die more popular techniques for clustering data of a set of data records includes 
partitioning operations (also referred to as finding pattern vectors) of the data using a k-means 
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Clustering algorithm wliich generates a minimum variance grouping of data by minimizing ihe 
sum of squared Euclidean distances from cluster centroids. The popularity of the k-mcaiis 
clustering algoriUun is based on its case of interpretation, simplicity of use, scalability, speed of 
convergence, parallelizability, adaptability to spurse data, and ease of out-of-core use. 

The k-means clustering algorithm fiinctiuns to reduce data. Initial cluster centers arc 
chosen arbitrarily. Reords from the database are then distributed among the chosen cluster 
domains based on minimum distances. After records are distributed, the cluster centers nrc 
updated to reflect Ihe means of all the records in the respective cluster domains. This process is 
iterated so long as the duster centers continue to move and converge and remain static. 
Ferfomiance of this algorithm is influenced by the number and location of the initial cluster 
centers, and by the order in which pattern samples arc passed through the program. 

Initial use of the k-means clustering algorithm typically requires a user or an external 
algorithm to define the number of clusters. Second, all the data points within the data set are 
loaded into the functiotL Preferably, the data points are indexed according to a numeric field 
value and a record number. Third, a cluster center is initialized for each of the predefined number 
of clusters, iiach cluster center contains a random normalized value for each field within the 
cluster. Thus, initial centers are typically randomly defined. Alternatively, initial cluster center 
values may be predetermined based on equal divisions of the range within a field. In a tburth 
step, a routine is perfomaed for each of the records in the database. For each record number from 
one to the current record number, the cluster center closest to the ctirrent record is determined. 
The record is then assigned to that closest cluster by adding the record number to the list of 
records previously assigned to the cluster. In a fifth step, after all of the records have been 
assigned to a cluster, the cluster center for each cluster is adjusted to reflect the averages of data 
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values contained in the records assigned to the cluster. The steps of assigning records to clusters 
and then adjusting the cluster centers is repeated until the cluster centers move less than a 
predetermined epsilon value. At this point the cluster centers are viev^^ed as being static, 

A fundamental starting point for machine learning, multivariate statistics, or data mining, 
is where a data record c:an be represented as a high-dimensional feature vector. In many 
traditional applications, all of the features arc essentially of the same type. However, many 
emerging data sets are often have many different feature spaces, for example: 

• Image indexing and searching systems use at least four ditTcrent types of features: color, 

texture, shape, and locjition, 

• Hypertext docuaients contain at least three dilTercnt types of features : the words, the out- 
links, and the in-Hnks. 

• XML has become a standard way to represent data records; such records may have a 
number of different to.tual, referential, graphical, numerical, and categorical features. 

• Profile of a typical on-line customer such a^j an Amazon.eom customer may contain 
purchased books, music, DVD/video, software, toys, etc. These above examples illustrate that 
data sets with multiple, heterogeneous features are indeed natural and common. In addition^ 
many data sets on the University of California Irvine Machine Learning and Knowledge 
Discovery and Data Mining repositories contain data records with heterogeneous features. Data 
clustering is an unsupervised learning operation whose output provides fundamental techniques 
tn machine learning and statistics. Statistical and computational issues associated with the k- 
means clustering algorithm have extensively been used for these clustering operations. The same 
cannot be said^ however, for another key ingredient for multidimensional data analysis: 
clustering data records having multiple, heterogeneous feature spaces. 
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SUMMARY OF THE INVENTION 

The invention provides a meihod and system lor integrating multiple, heterogeneous 
I'ealurc spaces in a Jfc-mcans clustering algorithm. The method of the inventiiMi adaptivcly selects 
the relative weights assigned to various features spaces, which simultaneously attains good 
separation along all of ihc feature spaces. 

The invention inlegmlcs multiple feature spaces in a *-means clustering algorithm by 
assigning different relative weights to these various features spaces. Optimal feature weights arc 
also determined that can be incorporated with this algorithm that lead to a clustering that 
simultaneously minimizes the average inlra-cluster dispersion and maximizes the average intcr- 
cluster dispersion along all the teaUure spaces. 

DESCRIPTION OF THE DRAWINGS 

The foregoing ^vill be better understood from the following detailed description of 
preferred embodiments of the invention with reference to the drawings^, in which: 

FlGs. 1 A and 1 B show a data computing system and method of the invention 
respectively; 

FlGs. 2A, 2B, 2C, and 2D show graphical results of an example using the invention; 
FIG. 3 shows tlie feasible weights for the second exemplary use of the Invention wherein 
when tlie feature space is 3, and the triangular region formed by the intersection of the plane at 

+0^ -\ with the nonnegative orthant of R^; 
FIG. 4 shows a newsgroups data set, in which plot macro-p versus the objective function 
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Qi ^ Qz ^ Q} Viirious different weight tuples; and 

FIG. 5 IS a Hovs'chart illustrating a preferred method of tbc invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS OF THE INVENTION 

1. introduction 

While the invention is primarily disclosed as a method, it will be understood by a person 
of ordinary skill in the art that an apparatus, such as a conventional data processor, including a 
CPU, memory, T/O, pnigram storage, a connecting bus, and other appropriate components, could 
be programmed or othfeTwise designed io facililale the projulice ^>f'lhe methud ufthe invention. 
Such a processor would include appropriate program means for executing the method of the 
invention. AlyOj an article of manufacture, such as a pre-recorded disk or other similar computer 
program product, for ase with a data processing system, could include a storage medium and 
program means recorded thereon for directing the data processing system to facilitate the 
practice of the method of the invention. Tt will be understood that such apparatus and articles of 
manufacture also fall vithin the spirit and scope of the invention. 

FIO. 1 A shows an exemplary data processing system for practicing the disclosed feature 
weighted K-means datit clustering analysis methodology that includes a computing device in the 
form of a conventional computer 20, including one or more processing units 21,0 system 
memory 22, and a syst<:m bus 23 that couples various system components including the system 
memory lo the processing unit 2 1 . The system bus 23 may be any of several types of bus 
structures including a memory bus or memory controller, a peripheral bus, and a local bus using 
any ol*a variety of bus architectures. 
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I'he system memory includes read only memory (ROM) 24 and random access memory 
(RAM) 25. A basic input/output system 26 (BIOS), conUuning Lhe basic routinci dial helps to 
transfer information between elements vsathin the computer 20> such as during start-up, is stored 
in ROM 24, 

The computer 20 further includes a hard disk drive 27 for reading from and writing to a 
hard disk, not showi% a magnetic disk drive 28 for reading from or writing to a removable 
magnetic disk 29, and iin optical disk drive 30 for reading from or writing to a removable optical 
disk 31 such as a CD-ROM or other optical media. The hard disk drive 27, magnetic disk drive 
28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 
32, a magnetic disk drive interface 33:. an t)ptical drive interface 34, respectively. The drives 
and their associated computer-readable media provide nonvolatile storage of computer readable 
instructions, data slruciures, program modules and other data for the computer 20. Although the 
exemplary environment desicribcd herein employs a hard disk, a removable magnetic disk 29, 
and a removable optical disk 3 1, it should be appreciated by those skilled in the art that other 
types of computer readable media v^hich can store data that is accessible by a computer, such as 
magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access 
memories (R AMs), read only memories (ROMs), and the like, may also be used in the 
exemplary operating envimnment. Data and program instructions can be in the storage ai'ea tliat 
is readable by a machuie, and that tangibly embodies a program of inslructions executable by the 
machine for performin^j; the method or the present invention described herein for data mining 
applications* 

A number of program modules may be stored on the hard disk, magnetic disk 29, optical 
disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application 
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programs 36, other proijram modules 37, and program data 38. A user may enter commands and 
information into the computer 20 through input devices such as a keyboard 40 and pointing 
device 42, Other input ievices (not shown) may include a microphone, joystick, game pad, 
satellite dish, scanner, or the like. Ihese and other input devices are often connected to the 
processing unit 21 through a serial port interface 46 that is coupled to (he system bus, but may be 
connected by other interlaces, such as a parallel port, game port or a universal serial bus (USB). 
A monitor 47 or other lype of display device is alst> connected to the system bus 23 via an 
interfecc, such as a video adapter 48, In addition to the monitor, personal computers typically 
include other peripheral output devices (not shown), such as speakers and printers. The 
computer 20 may operiile in a networked environment using logical connections to one or more 
remote computers, such as a remote computer 49. The remote computer 49 may be another 
personal computer, a server, a router, a network PC, a peer device or other common network 
node, and typically includes many or all of the elements described above relative to the computer 
20j although only a memory storage device 50 has been illustrated in FIG. 1 A. The logical 
connections depicted in FIG. lA include a local area network (LAN) 51 and a wide area network 
(WAN) 52, Such networking environments are commonplace in offices, enterprise-wide 
computer networks, ini.ranets and the Tntemei. 

When used in a LAN networking envirorunent, the computer 20 is connected to the local 
network 51 through a network interface or adapter 53. When ased in a WAN networking 
environment, the computer 20 typically includes a modem 54 or other means for establishing 
communications over Ihc wide area network 52, such as the Internet. The modem 54, which may 
be internal or external, is connected to the system bus 23 via the serial port interface 46. In a 
networked environmerit, program modules depicted i-elative to the computer 20, or portions 
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thereof, may be stored in the remote memory storage device. It will be appreciated that the 
network connections shown are exemplary and other means of establishing a comjnunicalions 
link between the computers may be used. 

The method ol' the invention as shown in general Ibrm in MG. 1 B, may be implemented 
using standard prugi-amming and/or engineering techniques using computer programming 
software, firmware, hardware or any combination or sub-combination thereof Any such 
resulting program(s), having computer readable program code means, may be embodied or 
provided within one or more computer readable or usable media such as fixed (hard) drives, disk, 
diskettes, optical disks, magnetic tape, semiconductor memories such as read-only memory 
(ROM), etc., or any transmitting/receiving medium such as the Internet or other communication 
network or link, thereby making a computer program product, i.e., an article of manufacture, 
according to the invention- The article of manufacture containing the computer programming 
code may be made and/or used by executing the code directly fix)m one medium, by copying the 
code from one medium to another medium, or by transmitting the code over a network. 

The computing system for implementing the method of the invention can be in the form 
of software, firmware, hardware or any combination or sub-combination thereof, which embody 
the invention. One skilled in the art of computer science will easily be able to combine the 
software created as described with appropriate general purpose or special purpose computer 
hardware to create u computer system and/or computer subcomponents embodying the invention 
and to create a computer system and/or computer subcomponents tor carrying out the method of 
the invention, 

The method of the invention is for clustering daUi, by estoblishing a starting point at step 
1 as shown in FIG, IB wherein, a given data set having m feature spaces, and each data object 
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(record) is represented as a tuple ofm feature vectors. To cluster, a measure of distortion 
between two data records is; needed. Since, different types of features may have radically 
different statistical distributions, in general, it is unnatural to disregard iVindamcntal diflerenccs 
between various diffenrnt types of fe*iturcs and to impose a uuiforni, un- weighted distortion 
measure across disparate feature spaces. In Section 2 below, a distortion between two data 
records as a weighted sum of suitable distortion measures on individual component feature 
vectors is provided; where the distortions on individual components are allowed to be different. 
In Section 3 below, using a convex programming formulation, the classical Euclidean A-means 
algtjrithm is generalized to use the weighted distortion measxire. In Section 4 below, optimal 
feature weights are seh^cted that lead to a clustering that simultaneously minimises the average 
intra-cluster dispersion and maximizes the average intcr-cluster dispersion along all the feature 
spaces. In Section 5, an outline evaluation strategy is provided. In Sections 6 and 7, two 
exemplary uses of the invention are provided for a) clustering data sets with numerical and 
categorical features; and b) clustering text data sets with words, 2-phrascs, and 3-phrases 
respectively. Using data sets with a known ground truth classification, the clusterings are 
empirically dcmonstmied that correspond to the optimal feattire weights and deliver nearly 
optimal precision/recal 1 performance. 

Feature weighting may be thought of as a generalization of feature selection where the 
latter restricts attention to weights that are either 1 (retain the feature) or 0 (elinunate the 
feature), see Wettschej eck el aK, Artificial Intelligence Review in the article entitled **A review 
and empirical evaluation of feature weighting methods for a class <^f lazy learning algorithms,"' 
Vol. 1 U pps. 273-3 14> 1997. Feature selecti«m in the context of supervised learning has a long 
history in machine learning, sec, for example, Blum et al., Artificial Tntelligence, "Selection of 
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relevant fea<urcs and e?umiplcs in machine learning," Vol. 97, pps. 245-271 , 1997. Feature 
selection in the context of unsupervised learning has only recently been systematically studied. 

2. Data Model and a Distortion Measure 

2*1 Data Model: Assume tliat a set of data records where each object is a tuple ofm 
component feature vectors are given. A typical data object is written as: x = (F/, F2,^..FJ, where 
the /-th component feature vector F/. ] < 1 <: /«, is to be thought of as a column vector and lies in 
some (abstract) feature space F/. The data object x lies on the m -fold product feature space F - 

Ff X F2 x.>«. The feature spacer {^/ }^_^ j can be dimcnsionally diHerent and possess different 

topologies, hence, the data model accommodates heterogeneous types of features. There are two 
examples of feature sp-4Ces that include: 

Euclidean Case: F/ is either K // ^1, or some compact submaiiifold thereof 
Spherical Case; Fi is the intersection of the y>dimensional,/ / ^ 1 , unit sphere with the non- 
negative orthant of R . 

2.2 A Weighted Distortion Measure: Measuring distortion between two given two 

data records x-(fy, Fj, ...F^)and x = , Fof i ^' / < w Jet D/ denote a 

distortion measure between the corresponding cuuiponent feature vectors F/ and F i . 
Mathematically, only two needed properties of the distortion function: 

• Di:F,xFi -►(0,co) 

# For a fixed r / , £>/ is convex in F, . 
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Euclidean Case The sciuared- Euclidean distance 



trivially satisfies the ncn-negativity and, for X 6[0,l], the convexity follows from : 



Di\Fi,;iFi^{\-X)Fi \ ^ADi 



+ {\-X)Di\Fi,Fi 



Spherical Case The cosine distance Di\^F] , F i j = \ - Fi Fi trivially satisfies the non- 
negativity and, for A (s[0,l], the convexity follows from: 



AF, + (I-A)F, 



5 ( , F ;) + (1 - ( h\ , i',), 



■where 1 1 .„| | denotea the Euclidean-norm. 1 he division by: \ \ X F i + - A) F i \ \ ensures that 
the second argument ol' Di is a unit vector. Geometrically, the convexity along the geodesic are 

defined as connecting the two unit vectors F j and F / and not along the chord connecting the 
two. Given ffi valid distortion measures between the corresponding w component 
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feature vectors of a: and X, a weighted distortion measure between x and x is defined as: 



x^x 



where the feature weights are non-negative and sum to 1 and <X ~ (ai,aff,...a^y 

The weighted distortion is a convux uumbination DfcDnvex distortion measures, and hence, 

for a fixed x^D^ is iJie convex in x. The feature weights {(Xf }^_^ arc enabled in the method, 

and arc used to assign different relative importance to component feature vectors. In Section 4 
below, appropriate choice of Ihese parameters is made. 

3. Mean$ with Weighted Distortion 

3.1. The Problem; Suppose that n-data records ore given such that 

^'=(''(/,i)>'''(/i)'--^(/,'"))'^-'-"' 

where the Mh, 1 <, I <m^ component feature vector of every data record is in the 
feature space F/. Partitioning of the data set (jC/ is sought into ^-rf/^j/o/rt/ clusters 

3,2 Generalized Centroids: Given a partitiDning {^^m }* ^ i , for each partition TT^ , 
write the corresponding generalized centroid as 

ARC9.2000-0078-US 1 4 1 

PAGE 41/63 ' RCVD AT 12/30/2004 7:09:S8 PM [Eastern Standard Time] ' SVR:USPT0-EFXRF-1/1 ' DNIS:8729306 ' CSID:301 261 8825 ' DURATION (mm-ss): 1 M8 



SENT BY: MCQINN& QIBB; 



301 261 8825 ; 



DEC-30-04 19:41; 



PAGE 



Where, for I< / > m, Ihe /-th component C^^^y is in F[ . as the solution of the following 



convex programming problem is defined as: 

/ 

I D 



(1) 



In an empirical average sense, the generalized centroid may be tliought of as being the closest in 
D ^ to all the data records in the cluster JT„ . The key to solving (I) is to observe thiil \% 
ctimponent-wise-conv«x, and, hence, equation (l)can be 5>olved by separately solving for each of 
its m components ^(^^ » \<\<m. In other v^ords, the following m convex programming 



problem is solved: 



f \ 
2 Di(Fi,Fi) 



(2) 



For the two feature spaces of interest (others as well), the solution of equation (2) can be written 
in a closed form using a F.uclideaxi and. Spherical case, respectively: 



Where;: = (/'i , ) 
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3.3 The Method: Re ferring to FIG. IB, the method of the invention uses the Ibrmiilation of 
equation (1) using the steps bcJow, wherein the distortion is meusured of each individual cluster 

^Ty ,1 < w < as: 



and the quality of the entire partitioning as the combined distortion ol'all the k 

clusters: ^^^^ Wha.1 is souglit is k-disjoint clusters such that equation (3) as 

follows is minimized wherein these k-disjoint clusters are; 

T T t , 



{ n V = a IT g m in 



' k 



(3) 

where the feature weights or = (orj , (ar2 , - - c^^n ) fixed. When only one of the weights 

\(Xi J is nonzero, the maximizalion problem (3) is known to be NP-compIete, meaning no 

known algorithm exist?? for solving the problem in polynomial lime. Af-means is used^ which is 
an efficient and effective data clustering algorithm. Moreover, ^-mcans can be thought of as a 
gradient ascent method, and, hence, never increases the objective function and eventually 
converges to a local minima. 
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In FIG, IB, aa overview of the processing components in shown for clustering dala. 
These processing comjxjnents perform data clustering on dala records stored on a storage 
medium such as the computer system's hard disk drive 27. 1'hc data records typically arc made 
up of a number of dala Gelds or attributes. Examples of such data records are discussed below in 
the two examples implementing the invention. 

The componcnls that perform the clustering require three inputs; the number of clusters 
a set of K initial starting points, and the data records to be clustered. The clustering of dala by 
these components produces a final solution as shown in step 5 as an output Each of' the K 
clusters of this final solution is represented by its mean (centroid) where each mean has d 
components equal to the number of attributes of the dala records and a fixed feature weight of 
the m-fcaturc spaces. 

A refinement of feature weights in step 4 below produces better clustering from the data 
records to be clustered using the methodology of the invention. A most favorable refined starting 
point produced using a gCK>d approximation of an initial starting point is discussed below that 
would move the set of =;tarting points that are closer to the modes of the data distribution. 
At Step 1 : An inilial point with an arbitrary partitioning of the dala records of the data 



generalized ccntroids associated with the given partitioning. Set the index of iteration / = 0. A 
choice of the initial partitioning is quite crucial to linding a good local minima; to achieve this» 
see a method for doing this technique as tau^t in U.S. patent 6,1 15,708 hereby incorporated by 



Al Step 2: For each data record <i find the generalized centroid that is closest 



records to be evaluated is provided, wherein, 




reference. 
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to X;. Now, for 1< w < compute the new partitioning STT^ ^\ induced by the old 



generalized cenlroi'ds: 



In words, ^ ^ ^^^^ records thai are closest to the generalized centroid ^ , 

!f some data object is siinultancously closest to more than one generalized centroid, then it is 
randomly assigned to one of the clusters. Clusters defined using equation (4) are known as 
Voronoi or Dirichlet partitions. 

Step 3: Compute the new generalized centroids (c^ ' ^| - ^u""^^ 

corresponding to the pariitioning computed in eqxiation (4) by using equations (l)-(2) where 

instead of 7t^^ , the following is used: TT^l^^^^ 

Step 4: Optioniilly refine the features weights and store in memory using the method 
discussed in Section 4 below. 

Step 5: If some stopping criterion is met» then set = C^^^^ for 1 < u< and exit at 

step 5 as the final clustering solution. Otherwise, increment / by 1 , and go to step 2 above and 
repeat the process. An example of a stopping criterion iji: Stop if the change in the objective 
function as defined in equation (6) below, between two successive iterations, is less than a 
specified threshold, for example the generalized centroids do not move and being less than a 
small lloating point nujnber. 
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4. Choice of the Feature Weights: Throughout ihh section, fix the number of clusters k<2 



and fix the initial partitioning used by the ^r-mcans algorithm in Step 1 above. Let: 



m 



Z <3r/ = l,ar/ ^ 0,1 ^ / ^/n 



denote the set of all possible feature weights. Given a feature weight a eA^ 



denote the partitioning obtained by running the A:-mcans algorithm with the fixed initial 
partitioning and the given feature weight. In principle, the A-means algorithm for every possible 

feature weight in A is run. From the set of all possible clusterings {;r(cy)|cr ^ a} , select a 

clustering that is in some sense the best, by introducing a figurc-of-racrit to compare various 

clusterings. Fix a partitioning J(<2r). By focusing on how well this partitioning clusters along 

the /-th, 1 i / component feature vector, the average within clusters distortion can be 
defined and average between cluster distortion along the l-ih comp<.)nent Feature vector, 
respectively, as: 
k 



^ ESS Di[Fi,cj J, 

^ A-1 U = \X GTTffV =l,V '^U 
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where X- (Fa A'^, It is desirable to minimize l i(a) and lo maximize A/(<lfX (i\c., 

coherent clusters that are well-separated from caeh other is dei%trablc). Hence, minimize: 

where «/ denotes the number of data records that have a non-/ero leaturc vector along the /-th 
component. The C|uatitity /i^ is introduced to accommodate sparse data sets. Tf the /-th feature 
space is simply R*^ and D/ is the squared-Euclidean distance, the T/ {a} is simply the trace of the 

within-class covariaiicc^ matrix and A/fa) is the trace of the between class covarianee matrix. In 
this case, 

Qt(a) 

is the ratio used in determining the quality of a given classification, and as the objective function 
underlying tliis multiple discriminanl analysis. Minimizing Qj {a) leads to a good 
discrimination along the /-the component feature space. Since il is desirable to simultaneously 

attain good discrimination along all the m feature spaces, the optimal feature weights <2 ^ are 
selected as: 



=ar g mm 

or G A 



m 



/=i 



(6) 



5. Evaluating Method Effectiveness: In assuming thut (he optimal weight tuple or ^ by 
minimizing the objective function in (6) has been selecletL How good is the clustering 
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corresponding to the optimal feature weight tuple? To answer this, assume that pre-dassilied 
data is given and benchmark the precision/recall performance of various clustcrii^s against the 
given ground tmth. Precision/recall numbers measure the overlap between a given clustering 
and the ground truth classification. This precision/recall numbers are not used in the selection of 
the optimal weight tuple, and arc intended only to provide a way of evaluating utility of Ifeature 
weighting. By using the precision/recall numbers to only compare partitionings with a fixed 
number of clusters k tliul is, partitionings with the same model complexity, a measure of 
effectiveness is providtuJ. 

To meaningfully define precision/recall, conversion of the clusterings into classification 
using the following simple rule is made by identifying each cluster with the class that has the 
largest overlap with thtt cluster, and assigti every element in that cluster to the found class. The 
rule allows multiple clusters to be a^isigned to a single class, but never assigns a single cluster to 

multiple classes. Suppose there are c classes in the ground truth classification. For a 

given clustering, by using the above rule, let a, denote the number of data records that are 
correctly assigned to the class Wf, and let hi denote the data records thai arc incorrectly rejected 
from the class Of , let (> \ denote the data records that are incorrectly rejected from the class 0)^ * 

Precision and recall art: defined as: 
F'i^ V- = ^/ = —A<t^c, 

The precision and recall arc defined per class. Next, the performance averages across classes 
using macro-precision (macro-p), macro-recall (macro-r), micro-precision (micro-p)» and micro- 
recall (micro-r) are captured by: 
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inacro-p=- — y , aad macnj-r = — 
c c 

t=] /=1 

(a) 1 ' 

micro-p = micro - r = — L a*, 

n 

/ =1 

where (a) follows since, in this case, Xf-i (a^ -\-b^^— S^-^ (a^ +Cf^ = n. 

I 

6. Examples of Use of the Method of Clustering Data Sets with NiimericAl and 
Categorical Attributes 

6.1 Data Model; Suppose a data set with both numerical and categorical features is given. 
By linearly scaling each numerical feature, that is, subtracting the mean and divide by the 
square-root of the variance. All linearly scaled numerical features into one feature space are 
clubbed, and, tor this feature vector, the squared-Huclidcan distance is used, by representing each 
^-ary categorical feature using a l-in-q represcnlation, and all of the categorical features are 
clubbed into a single feature space. Assuming no missing values, all the categorical feature 
vector have the jiame norm, by only retaining the direction of the calegorical feature vectors, lhat 
is, normalizing each categorical feature vector to have an unit Huclidean nonn, and use the 
cosine distance. Essentially, each data objeci x is represented as a w-tiiple, /n = 2, of feature 

vectors (^Ij-Pj)- 

6,2 HEART and ADIH.T Data Sets: FKls. 2 A, in, 20, and 2D show graphs of a firjrt 
example using the method of die invention using HEART (tesp. ADULT) data, further defined 
below. FlGs. 1 A and 1 R respectively, each show a plot of the objective function ^^1 x 02 in 
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equation (6) versus the weight c--x. The vertical lines in FlGs. 2A, 2B, 2C, and 2D indicate the 
position of the optimal weight tuples. For the HFART (resp. ADULT) data, FIGs. 2C and 2D 
respectively, each show a plot of macro-p (resp. micro-p, macixj-p, and macro-r) versus Ihe 
weight AT, . For the TIE ART data, macro-p, macro-r, and micro-p numbcra are very close la each 

other, thus, to avoid visual clutter, only phnted macro-p numbers are shown. For tlic ADlJTvT 
data, the top, the middle, and bottom plots in Fl( j. 2D are micro-p, macro-p, and macro-r. 

The HEART data set consists of w ~ 270 instances, and can be obtained &om the 
STATLOG repository. Every instance consists of 7 numerical and 6 categorical features. Tlie 
data set has two classes: absence and presence of heart disease; 55,56% (resp. 44.44%) instances 
were in the former (resp, later) class. The ADLJl/f data set consists of w 32561 instances that 
were extracted from th^ 1994 Census database. Every instance consists of 6 numerical and 8 
categorical features. The data set has two classes: those with income less than or equal to 
$50,000 and those with income more than $50,000; 75.22% (resp. 24.78%) instances were in the 
former (resp. later) class. 

63 The Optimal Weights: In this case, the set of feasible weights is 

A = ^(Xi ^0£2^0£\ + €^2 = 1 1 ^1 5 ^ 0 }■ ^'"'^ number of dusters t = 8 is selected, and a binary 
search on the weight a\ € [0,1 ] is done lo minimi/e the objective function in equation (6). 

For the HEART (resp. ADULT) data, FKjs. 2A and 2B, respectively show aplot of the 
objective function Q] x Qj in (6) versus the weight ^ ai. For the llliARl and ADULT data 

sets, the objective function is minimized by the weights (0. 12, 0.88) and (0. 1 1 , 0.89), 
respectively. 

For the HEART (resp. ADUL l ) data, FIGs. 2C and 2D, respectively show a plot of 
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macm-p (rcsp. micro-p, macro-p, and macro-r) versus the weight dTj . By comparing FIG. 2A 
with FIG. 2Cy and FIG. 2B with FTG. 2D, it can be seen that, roughly, macro-p (resp. micro-p, 
macro-p, and macro-r) are negatively correlated wilh the objective fiinction Q\ x^Q2 and that, 

in fact, the optimal weight tuples achieve nearly optimal precision and recall. In conclusion, 
optimizing the objective function Q\ X Q2 leads, reassuringly* to optimizing the precision/recall 

performance, thus leads to good clusterings and a final solution. 

7. Second Example of Using the Method of Clusttering TejLt Data using 

Words and Phrases as the Feature Spaces 

7J- Phrases in Information Retrieval: Vector space models, represent each document 
as a vector of certain (possibly weighted and normali/.ed) term frequencies. Typically, terms are 
single words. Howevt::, to capture word ordering, it is intuitive to also include multi-word 
sequences, namely, phiases, as terms. The a<5e of phrases as terms in vector s-pace models has 
been well studied- In the example as follows, the phrases along wiih the words in a sir^le vector 
space model are not a club. For information retrieval, when single words are also simultaneously 
used, it is known that natmal language phrases do not perform significantly better than siaiisUcal 
phmses. Hence, the focus is on statistical phrases which arc simpler to extract, see, for example, 
see Agrawal et al. in *'Mining sequential patterns,''' Proc. Int. Conf Dala Eng., (1 995)- Also, see 
Mladenic et al., *'Word sequences as features in text-learning" in Proc, 7th Eleciroiech Computer 
Science Conference, Ljubljana, Slovenia, pages 1 45- 1 48, (1998) found that while adding 2-and 3- 
word phrases improvcc: ihc classifier performaiice, longer phrases did not. Hence^ tlie example 
illustrates single words, 2-word phrases and 3-word phrases* 

7.2 Data Model: FIG. 3 shows the feasible weights for the second exemplary use of the 
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invention wherein when m ^ 3, A is the triangular region formed by the intersection of the plane 

, 3 

at a, H- + - 1 \vlth Ihc noiinegalivc orthant of R , xhe lefl-vcrtcx, tlie right-vertex, und 

the top-vertex of the triiuigle corresponds to the piMnts (1, 0, 0), (0, 1, 0), and (0, 0, 1), 
respectively. Each docnmenl is represented as a triplet of three vectors: a vector of word 
frequencies, a vector of 2-word phrase frequencies, and a vector of 3-word frequencies, that is, a 
= (^1 , ^2 '''3)- It is now shown how to compute such representations ior every document in a 
given corpus. This creation of the finrt feauire vector is a standard exercise in information 
retrieval, llie basic idea is to construct a word dictionary of all the words that appear in any of 
the documents in the corpu^s, and to prune or eliminate a stop word from this dictionary. For the 
present application, also eliminated are those low-frequency words which appeared in less that 
0.64% of the documents. Suppose./} unique words remain in the dictionary after such 
elimination. Assign an unique identifier from 1 to f/ to each of these words. Now, for each 
document x in the corpus, the first vector // in the triplet will be a/;dimensional vector. The jth 
column entry, 1 <j <^f2, ofFj is the number of occnrrences of theyth word in the document x. 
Creation of the second (resp. third) feature vector is essentially the same as the first, except that 
the low-frequency 2-word (rcsp. 3-word) phrase elimination threshold to <Mie-half (resp. 3-word) 
phrase is generally Ics.^ likely than a single word. Let/2 (resp./j) denote the dimensionalities of 
the second (resp. third) feature vector. Hnally, each of the three components Fi , Fj, F2 is 
normalized to have a unit Euclidean norm, that is, their directions are retained and their lengths 
are discarded. There are a large number of tenn-weighting schemes in information retrieval for 
assigning different relative importance to various temis in the feature vectors. These feature 
vectors correspond to a popular scheme known as normalized term frequency. Tlie distortion 
measures Di, D2, and Y>y are the cosine distances. 
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73 Newsgroups Data: Picked out of the following 10 newsgroups from the 
*'Newsgroups data" from a newspaper to illuslrale the invention: misc.forsale 
sci.crypt comp.wiiidows.x Comp,sys.mac.hardware 
rec.autoij rcc.sporl.bascball socrcligion.christian 
sci *$pace tal k.pol ) tics.guns talk.politics .mideast 

Each newsgroup contauis 1000 documents; aJlcr removing empty documents, a total of n 9961 
documents exist, For ihh data set, the unpruncd word (rcsp. 2-word phrase and 3-word phrase) 
dictionary had size 72586 (resp. 429604 and 461 1 32) out ofwhich 0 = 2583 (resp./? - 2144 and 
f2 = 2268) elements that appeared in at least 64 (resp. 32 and 16) documents was retained. All 
the three features spaces were highly sparse; on an average, after pruning, each document had 
only 50 (rcsp. 7,19 and 8,34) words (resp. 2-word and 3-word phrases). Finally, rtj - n - 9961 
(resp. Hi = 8639 and = 4664) documents had at least one word (rcsp. 2-word phrase and 3- 
word phrase). Note that the numbers w/, w^, and wj arc used in equation (5) above. 

7.4 The Optimal Weights: FIG. 4 shows a newsgroups data set, in which plot macro-p 
versus the objective function x for various different weight tuples. The macro-p 

value corresponding to the optimal weight tuple is shown using the symbol and others are 
shown using the symbol The negative correlation between macro-p and the objective function 
is evident from the plot. Macro-p, macro-r, and micro-p numbers arc very close to each other^ 
thus, to avoid visual clutter, only plotted macro-p numbers are shown. In this case, the set of 
feasible weights is the i riangular region shown in FTG, 4, The A^-means algorithm with * = 10 on 
the Newsgroups data si^ vnth 3 1 different feature weights are shown using the symbol • in 
HIG»4. Ilie objective ftuiction in equation (6) is minimized by a weight tuple (0.50, 0.25:, 0.25). 
further, FIG. 4 roughly shows that as the objective function decreases, macro-p 
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increases. The objective function corresponding to the optimal weight tuple is plotted asing the 
symbol; by definition, this is the left-most point on the plot. It can be seen that the optimal 
weight tuple has a smaller niacro-p value than only one other weight tuple, namely, (.495, .010, 
.495). Although macro-r and micro-p results are not shown, they lead to the same conclusions. 
Use of the precision/recall numbers reveals that optimal feature weighting provides good final 
data clustering. 

As discussed above, it has been assumed that the number of clusters k is given; 
however, on important problem that the invention can be used for is to automatically determine 
the number of clusters n an adaptive or data-driven fashion using information-theoretic criteria 
such as the MDL principle, A computationally efficient gradient descent procedure for 
computing the optimal parameter tuple a t can be used. This entails combining the optimization 
problems in equation (6) and in equation (3) into a single problem that can be solved using 
iterative gradient descent heuristics for the squared-Euclidcan distiince. In the method of the 

invention, the new weighted distortion measure £) " has been used in the <r-means algorithm; it 
may also be possible to use this weighted distortitjn with a graph-based algorithm such as the 
complete link method or with hierarchical agglomerative clustering algorithms. 

In summary, Ihc invention provides a method for obtaining good data clustering by 
integrating multiple, hi:terogencous feature spaces in the i-mcans clustering algorithm by 
adaptively selccUng relative weights assigned to various features spaces while simultaneously 
attaining good separation along all the feature spaces. 

Moreover, FlO. 5 illustrates a preferred method of the invention fully described above, 
where a method for evaluating and outputting a clustering solution for a plurality of muUi- 
dimensional data records comprises defining (110) a distortion between two feature vectors as a 
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weighted sum of distortion measures on componenis of the feature vectors; clustering (112) the 
TnuHi-dimensit)nal dala records into k-clusters using a convex progranrniLing formulation; and 
selecting (1 14) optimal feature weights of the feature vectors by an objective function to pmduce 
a solution of a final clusieritig that simultaneously nainimizes average intra-cluster dispersion and 
maximizes average inttT-cluster dispersion along all the feature spaces. 

While the preierred embodiment of the present invention has been illustrated in detail, it 
should be apparent that modifications and adaptations to that embodiment may occur to one 
skilled in the art without departing from the spirit or scope of the present invention as set forth in 
ihe rolluwiiig claims. 
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